{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "fodGF4u8srn7"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/splade/splade-vector-generation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/sparse/splade/splade-vector-generation.ipynb)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "7aAUyL65sx8S"
      },
      "source": [
        "# SPLADE Sparse-Dense Embedding Generation\n",
        "\n",
        "[![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/fast-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-quora.ipynb)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "yTJM3Al3srTj"
      },
      "source": [
        "## Overview\n",
        "\n",
        "[SPLADE](https://www.pinecone.io/learn/splade/) is a class of models that produce sparse embeddings. Dense embeddings are often difficult to interpret, but sparse embeddings have clearly identifiable token overlap, making sparse vector search results more interpretable. SPLADE models have been shown to consistently outperform dense models, particularly in out-of-domain settings. \n",
        "\n",
        "The following guide will show you how to construct SPLADE embeddings to use in Pinecone's sparse-dense vectors. See the [companion guide to skip embedding generation](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-quora.ipynb)."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "6oJGEeD-Gz9K"
      },
      "source": [
        "## Prerequisites"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "nyohRGUAGz9K"
      },
      "source": [
        "We'll install the required libraries:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "eYRHcfJvtksh",
        "outputId": "c60a58df-0e6d-4815-b6e4-d06d9cf5ec45"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
            "To disable this warning, you can either:\n",
            "\t- Avoid using `tokenizers` before the fork if possible\n",
            "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
          ]
        }
      ],
      "source": [
        "!pip install -qU \\\n",
        "    pinecone-text[splade] \\\n",
        "    pandas==2.0.2 \\\n",
        "    datasets==2.12.0 \\\n",
        "    pyarrow"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "id": "Qa-xdvhDH9mB"
      },
      "outputs": [],
      "source": [
        "from tqdm.notebook import tqdm"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "U0S_WoC_Gz9M"
      },
      "source": [
        "## Quora Dataset\n",
        "\n",
        "First We'll load the popular [Quora dataset](https://huggingface.co/datasets/quora).\n",
        "The Quora dataset is composed of question pairs, and the task is to determine if the questions are duplications of each other (have the same meaning)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "id": "QQyOLjCNOO9L"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Found cached dataset quora (/Users/amnoncatav/.cache/huggingface/datasets/quora/default/0.0.0/36ba4cd42107f051a158016f1bea6ae3f4685c5df843529108a54e42d86c1e04)\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "26539f0c087d488892e3fd62044296a6",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from datasets import load_dataset\n",
        "import pandas as pd\n",
        "\n",
        "dataset = load_dataset(\"quora\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's look on a single record in this dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'questions': {'id': array([15, 16], dtype=int32),\n",
              "  'text': array(['How can I be a good geologist?',\n",
              "         'What should I do to be a great geologist?'], dtype=object)},\n",
              " 'is_duplicate': True}"
            ]
          },
          "execution_count": 19,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "raw_df = dataset[\"train\"].to_pandas()\n",
        "\n",
        "raw_df.iloc[7].to_dict()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see above, each record contains two questions and a binary label that indicates whether the questions duplicates of each other.\n",
        "\n",
        "Before we start to create embedding from our data, we need to process it. First, we need to convert the JSON column into distinct columns within our dataframe."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 20,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>is_duplicate</th>\n",
              "      <th>text</th>\n",
              "      <th>id</th>\n",
              "      <th>dq_id</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>False</td>\n",
              "      <td>What is the step by step guide to invest in sh...</td>\n",
              "      <td>1</td>\n",
              "      <td>2</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>False</td>\n",
              "      <td>What is the story of Kohinoor (Koh-i-Noor) Dia...</td>\n",
              "      <td>3</td>\n",
              "      <td>4</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>False</td>\n",
              "      <td>How can I increase the speed of my internet co...</td>\n",
              "      <td>5</td>\n",
              "      <td>6</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>False</td>\n",
              "      <td>Why am I mentally very lonely? How can I solve...</td>\n",
              "      <td>7</td>\n",
              "      <td>8</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>False</td>\n",
              "      <td>Which one dissolve in water quikly sugar, salt...</td>\n",
              "      <td>9</td>\n",
              "      <td>10</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   is_duplicate                                               text  id  dq_id\n",
              "0         False  What is the step by step guide to invest in sh...   1      2\n",
              "1         False  What is the story of Kohinoor (Koh-i-Noor) Dia...   3      4\n",
              "2         False  How can I increase the speed of my internet co...   5      6\n",
              "3         False  Why am I mentally very lonely? How can I solve...   7      8\n",
              "4         False  Which one dissolve in water quikly sugar, salt...   9     10"
            ]
          },
          "execution_count": 20,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "raw_df[\"q1\"] = raw_df[\"questions\"].apply(lambda q: q[\"text\"][0])\n",
        "raw_df[\"q2\"] = raw_df[\"questions\"].apply(lambda q: q[\"text\"][1])\n",
        "raw_df[\"id1\"] = raw_df[\"questions\"].apply(lambda q: q[\"id\"][0])\n",
        "raw_df[\"id2\"] = raw_df[\"questions\"].apply(lambda q: q[\"id\"][1])\n",
        "\n",
        "q1_to_q2 = raw_df.copy().rename(columns={\"q1\": \"text\", \"id1\": \"id\", \"id2\": \"dq_id\"}).drop(columns=[\"questions\", \"q2\"])\n",
        "q2_to_q1 = raw_df.copy().rename(columns={\"q2\": \"text\", \"id2\": \"id\", \"id1\": \"dq_id\"}).drop(columns=[\"questions\", \"q1\"])\n",
        "flat_df = pd.concat([q1_to_q2, q2_to_q1])\n",
        "\n",
        "flat_df.head()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, we take a sample of the data and retaining just one row per question while storing a list of duplicated question IDs for each question."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "      <th>id</th>\n",
              "      <th>duplicated_questions</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>2003</th>\n",
              "      <td>What name will you give to a mathematics educa...</td>\n",
              "      <td>3985</td>\n",
              "      <td>[3986, 535562]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2004</th>\n",
              "      <td>What universities does Federal-Mogul recruit n...</td>\n",
              "      <td>3987</td>\n",
              "      <td>[3988, 23638]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2005</th>\n",
              "      <td>Why do people go for mba after masters in engi...</td>\n",
              "      <td>3989</td>\n",
              "      <td>[3990, 161829, 316113, 168810]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2006</th>\n",
              "      <td>How does a facebook account get hacked?</td>\n",
              "      <td>3991</td>\n",
              "      <td>[3992, 379602, 35699]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2007</th>\n",
              "      <td>Why do black people get upset when non-black r...</td>\n",
              "      <td>3993</td>\n",
              "      <td>[3994]</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                   text    id   \n",
              "2003  What name will you give to a mathematics educa...  3985  \\\n",
              "2004  What universities does Federal-Mogul recruit n...  3987   \n",
              "2005  Why do people go for mba after masters in engi...  3989   \n",
              "2006            How does a facebook account get hacked?  3991   \n",
              "2007  Why do black people get upset when non-black r...  3993   \n",
              "\n",
              "                duplicated_questions  \n",
              "2003                  [3986, 535562]  \n",
              "2004                   [3988, 23638]  \n",
              "2005  [3990, 161829, 316113, 168810]  \n",
              "2006           [3992, 379602, 35699]  \n",
              "2007                          [3994]  "
            ]
          },
          "execution_count": 21,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df = flat_df.drop_duplicates(\"id\").head(2000)\n",
        "df[\"duplicated_questions\"] = df[\"id\"].apply(lambda qid: flat_df[flat_df[\"id\"] == qid][\"dq_id\"].tolist())\n",
        "df = df.drop(columns=[\"dq_id\", \"is_duplicate\"])\n",
        "\n",
        "df.tail()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "LpQi7d6AuymO"
      },
      "source": [
        "### Sparse Embeddings with SPLADE \n",
        "\n",
        "In the following example we will use the [naver/splade-cocondenser-ensembledistil](https://huggingface.co/naver/splade-cocondenser-ensembledistil) SPLADE model.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aKeYepgbHUMP",
        "outputId": "af107fd4-bf0c-4a07-e723-5b4b734949dd"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "running on cpu\n"
          ]
        }
      ],
      "source": [
        "import torch\n",
        "\n",
        "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
        "print(f\"running on {device}\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "SPLADE is currently avialble in Hugging Face only as a BERT-based transformer. To encode text into sparse vectors, simple modifications to the transformer's output are required, as detailed in the [paper](https://arxiv.org/abs/2109.10086). For ease of use, we have encapsulated this model in the `pinecone-text` library."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 209,
          "referenced_widgets": [
            "b99e2d309f1844fa9ce1ff9ccb869569",
            "a137392479b546c6a9878d3e0cd947ba",
            "5ae884981dd2419ca22a7c8bfb9af430",
            "9c4f4015f13344c1833371ddc28b6a8c",
            "038c9cae6bff4736bf8dcedca8bdf2f5",
            "ef6ef87149fa403dbe83d6cb88568866",
            "518f3a07e39d4b02a3201622e62bf94a",
            "23b50a00ca864611acb954f9fb7dec6c",
            "16b765dfbea749a1bb2fc82cb963b7c0",
            "aabf8f7a97074e21b542753e1aba5868",
            "ba3ee72ba9314d80aae1a3fbf34b0ed2",
            "e3afdae312c445dd851de6851162020f",
            "74ff9389baf547f28e9f36e652c0e2de",
            "df7566936b1c41f59ed48a9eda726d3e",
            "1d1c48e4e7984150ae5880484818aea4",
            "9541f05a5a374ace9552c5feadf20e7e",
            "cef5f5c14f1141ce9667cc9308865f0a",
            "c7e763e94a1843fda138db17500f361e",
            "11023d489e3242daa36c209ac63f5923",
            "7a1b6e6d6ec448e8877473f98ed65615",
            "772950e136cb49dc9a734ce2d13cc87c",
            "bb5a0b4fd89e4cc3be4589f86a6794e0",
            "55ae0a3ae30c4104ad5be1b8be24fa14",
            "913d407b38ab4ff8b14e81be70523fef",
            "9ecd7ffd7a9e4188ae67f61e2777e971",
            "ba98012c85b3497f971c87b0ab76cf14",
            "092252abd46b4e2f926a1d02fdd97a71",
            "c8290d1d348c43f9b6daa24a83fcca67",
            "4503cdb84a494a73ad7806edac4ef8de",
            "f352ef9235934dc3ae3d00015daed488",
            "f7c5c4d3b7694b26b4e28be23e7b63c3",
            "f4e8209fc12d48edad70f13ecb64898f",
            "7f59593d610240a0ae5288408bbe521b",
            "227afc7f1729483dacb1394a0604a19f",
            "26573e8f53b44b7a93529dedf7307bcd",
            "4ca559c1d327476fb54a5c4b45e75add",
            "e0bceaf156334b70b43fc369ad88c17f",
            "7d4adea2513743319bdc0c444991d4e7",
            "3439aee89c774afda188b843ea5baed1",
            "7c17dc7e94a6426280af5bd0b7aac4bb",
            "11e36d92d1cf4522b34338e7ed0e559e",
            "dfb0485b22324a5f88eae0c7880c8b47",
            "6f5cea26a52843c1b0bd4ed38c399930",
            "638fa8efb2f44ca784c966143b6d96be",
            "3eea4e02ac394bf1b2359ca2505dfd52",
            "e333e335e9aa41c18e60d61b05165c23",
            "95f7de0fdb4445a3b4d6685d9f06d7d3",
            "e5770287d11943b1891b68450f43fa33",
            "d9407bddca0e4c11b8a4b7890be54858",
            "1a6228c55a8642f8870dd05bd4df39d8",
            "52cd4e8693bb46edb3518ba1d5bf0815",
            "7c4509ca762c4a32b9438e0dac0d0f19",
            "5c22cd2cd2dd4e13ad9ade3326821c21",
            "749bf85fdc084029bacb948ca2231cbd",
            "aeddd7f58a914d949caa7a9a652fd020",
            "4b3f77fb54ee46f080a07517fa8df152",
            "b58de74e83a8411b81c952abf29387f1",
            "78969366a6d04427bb780393c4f59184",
            "534cf0f2035c4336b0ef0d79f112eb80",
            "27cf66b47cec44d392371c66d38e3781",
            "3ef1838f15f846f3bcbb8e5ad0ccac67",
            "7765c276bd584440bd8b05696632b729",
            "af7d8bbca6104c2897aeae171d0d6359",
            "7dacefe4e87d439dbb5cb11ffc69a26a",
            "ad515d856e5d48119edf3e397b468796",
            "163401a540874769917134db5111a933"
          ]
        },
        "id": "wDvs3xXZNoHn",
        "outputId": "f1977b5e-4ba0-4929-adc3-3d7f5216823d"
      },
      "outputs": [],
      "source": [
        "from pinecone_text.sparse import SpladeEncoder\n",
        "\n",
        "splade = SpladeEncoder(device=device)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {
        "id": "dU7zBK4INppu"
      },
      "outputs": [],
      "source": [
        "doc = \"what is the capital of france?\"\n",
        "sparse_vector = splade.encode_documents(doc)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "WbnG4ON1Gz9N"
      },
      "source": [
        "\n",
        "### Dense Model\n",
        "\n",
        "We use the popular all-MiniLM-L6-v2 model available on Hugging Face for dense vectors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 465,
          "referenced_widgets": [
            "e1eeb0f48a4045a98d508a1d0b364372",
            "b26051982b504dc2990c2fb4fb64ebf1",
            "ff2cb73dde9d42c68c71c013a2cb78b6",
            "5601f202d6ff43daabb586f887e3cc0a",
            "341ca27d6e724a41ae7a66517b53cac8",
            "1eb3ee7b8e4a423abbb660fa8c077730",
            "0b81c8d9ef9a4f39b371ee73ee97590e",
            "f6ad54af8adf471393220e3da992fd16",
            "2375dd5bcdbd4ce48b3fe955a5b56f78",
            "50ad1f037f974bfd9a976a1e0a850f76",
            "555db6fff7ba450ea5af5c0947c3dfc7",
            "ff4a31b4404d40218f03be260e9965d2",
            "1daef9ad805b43ff9085df6ae2b50c85",
            "1735cf8006b4428bb405c936ead94e98",
            "d3fd3b31fa3a44f282431b0ed098cf14",
            "37f9a43001974bb0a3d78fd3dcd6f0d6",
            "ce94ea343f5740598e13cf7f1f57ab2e",
            "080ab15324dd41de911184fba1ff4f93",
            "3a9722ff628646c6927da0431e68c6ca",
            "bf3b841243c24ab8b9b56f62f45212f2",
            "272930a4447447afb73653dd2e980880",
            "4d4b985680904d038b76398e6fd26023",
            "a089beace66648e480625856089c47b1",
            "ea91971afcf34bbaba8ea1083581cbb6",
            "bb77ccc0047149538cb5fb6fcaf5b13c",
            "7f43eec77ffe4390bde1bcf3bd697c02",
            "44081fdd829a4976b224b77e68eeed54",
            "3f984a5593e148ddb567208c9bdde8cb",
            "6c73343b397a491f8217fc628a445405",
            "7128def8c81d4ca8a4e3efcff07b5945",
            "c1618f8f8486436e9b0ba08596ad8c77",
            "1064c3a69c2a431eb9e6a09bbfefb556",
            "e55b0361f7154df3b80b26b222417ad1",
            "1c40b9819daa4f4d87cdbfe2b5eb2bbc",
            "1aff9194632c4dfeb2e4e43b60c40984",
            "d2cb9333ebc44c33898d289fa01bda76",
            "d8b7dd9744f1404d9b67c47ec0bcdfe8",
            "579df56986684189bd858ffc9d7856d9",
            "7cf7466423054b5b8ef72102bc9482fa",
            "f71cd7aabdd44325a1b0bcce6770cede",
            "a78ada4ffd45472e9f6c328fe6823e1c",
            "0b0077d32a2840f3802fbf7c89f0c217",
            "091494edadff4f278a06da6022f42a06",
            "ef1e80bb794f412cad286561abb7c7a9",
            "98581d5391254b0b9fe75fd288f9e681",
            "a578b0bc06f74ea09ebb30d13b618e28",
            "2face4f15fa844e292f6b0c18a58425b",
            "49a84eb3a0a142d28383104353b8ba0c",
            "8e6ba67c5c5542ddbdb21d9d7a02bdb2",
            "91399fed29124a1ca43566ca53a2388b",
            "775dd0e19bd14e5388cab49abc870648",
            "ab7ee59605be45ebb9aa5404d9bbdffd",
            "700f4151f05f409e94ee5dbb3b426b2a",
            "9641e2ebed0842cd8ebbb2e760af8111",
            "06144e6119ec49afa52ef516532616bc",
            "97b9ad82ca6340c2b608c5971f6947ba",
            "5d9d4a774e7a43f8b0b190d9441bcd35",
            "d2a6984617a54e2b8f745ddcf769c50e",
            "3edaf95566894b3aa9a992a599503af4",
            "79e2de6afe4b413d91626bd62c00c44f",
            "791ee8318c2647cda8b16c9398dd35e2",
            "0773c3c6dffd42aeb52656b9d5d2614d",
            "1969e2d7eb51422087c8a59422401ecf",
            "32fa70dbe1524ccc93f4025780e053a3",
            "4b395efcd477497595dac2735363f7c5",
            "f013286d091d4bde9671f77f0cf43fb2",
            "8f9ceec2814a4578b8f8de287fbb9a23",
            "a20fc60785af48e8a63fcfaad3a61496",
            "cbd449d62c214bff9b2f8c779724f25f",
            "00769436c0d14353846256c7a9ff0ba5",
            "e8095d9c3ee046abbd6676d2c39c5ca5",
            "a8fa4e6dc6644cf59145cff97ab643d8",
            "3831bc12ea6a4ed790b4110c316c718b",
            "1dff6431d6e443c89f5981fa16ca9904",
            "7b339de50bd841c6839f3cfc7761634a",
            "486ce99b8f094fcca577b1b47adb5df5",
            "52756776d6824c2a91cdc92fbc66a6d9",
            "1b6c4498363d47a39cc0d2d7f353d619",
            "149f3c9970a241c4959f5fdd075a0de7",
            "a2ee9d5a1b1448c79fd8ba7964f35e9b",
            "c51e3e801410446b91a1488f8c239ecd",
            "a287123726384105abd94b88e218e1bd",
            "bb6fe43515db40caab0955e80a519c18",
            "20536029d92745d095b66308ea0933ea",
            "54b5775e832441adb156bde83f691ff9",
            "f2fbd7ae28df412ba6725a21416a186a",
            "76131858be0c4327be68d9f5569e0cba",
            "54b93aaba8c74381b07523bfcdc65cb1",
            "536e3381532042ea863111c612a7efc5",
            "83a158e4d47d4fdfadde91e127543484",
            "ecb6417f3aac4e7fa2f9d3d79ef07ca4",
            "a84b3cbe55ad404fb624007e4025d34b",
            "1ea5f6524f2f4c89a78c49b6d8451d4d",
            "cc4414c41b2d46f3b1137433698bd054",
            "e5738bf18c0a4ec6bdb9d4abe7f04aa8",
            "39b31075df614dca857dacc21c546f20",
            "eff9cf42a3774445a0e09a9ba549efb2",
            "1d89a0c779ea41fd8650d74de8479b72",
            "e12135edec084776b4461f0bbf4faa08",
            "c0f7b027b0af4916a598b36530390981",
            "4d03989c5a454deaaa2e3b3d95d0b7f7",
            "dae3af051fdb4001ad117692a3e02af3",
            "d259577fcae942398e7f611a815dca90",
            "e9fffb65a0a04ed5ac57d81c097334d8",
            "8025af11394d4a13a295405a703d98d7",
            "c0c86bb9189f427886f26c4a37a94341",
            "72e552f2521d4e528e90ff4219c6987c",
            "e2fbe1ee907d47e890a8b19825be8fed",
            "3861bdc54b8a47f1bd1310a22068e5d1",
            "f82817a047c2431a938df863c431cae7",
            "1a5fc980a08242cba132ea4cdc5122c0",
            "0e42ca7181e34d0e8ca84e9f14909267",
            "eb90bf31b01e4afa932ccfdfd5a92aa8",
            "04fe7b54b87f4ac0ad48a1e535649036",
            "2f42e7b3864f4f1caa7d1e3a3fd4909d",
            "dfbe45e546b54bde97c4fb62bbd72eab",
            "2807337b51bc48839e9e57fc1e7904a7",
            "3283815bf04f4a2e8af44ab62690baf9",
            "1783227a340446ddb827e87ba994ee38",
            "f75c395b31e14b3baba9cb7e22c5ae6c",
            "b4de6ef87ec3413d8ab462980a1c5a3b",
            "359be9ceb5254e26955b1cc3a37686b5",
            "7a789dc46f8f4074b769bd82928e598d",
            "f57459d1fa9749de98f7150f95d2fb73",
            "353cf9e35eb94c9fbfb294e7487bff34",
            "7aebdbcbd75d44c18a73d9545043b383",
            "4de3ed982f45490ab59a7d97050662a7",
            "7fa1c9a0b04041e1bc2f76d7cd374b14",
            "0d7bcc327cd047719e9c0b1dbad03760",
            "93104a537d2e45c2b6a9b14ae6a8175a",
            "e15e32397b0e4573b03784861a146cfa",
            "c6d310fe4f90445bb82260dab2747e17",
            "348acd5a9da347399bc6b69986962d4c",
            "d9f0859d8d55480d9000b591337f00f0",
            "b46081c0305b4c669dd0c2c179dc7b3d",
            "a507a8a56e6e46419d8ef4e12565edb4",
            "35e8acd420b04898b749181429c44f5a",
            "8a28409e8dc14af5bbdfa9b5f0dce8d2",
            "aa81824dad0846dcbbf7b715e2e2f599",
            "ecb1f20df9714cd5b6ae131149c23674",
            "6d025935522c4f4c8380dc1e0b6ad325",
            "2dbbec3d59a64ab0843418f1d3ee2693",
            "a484451972054aa981d513ad4ebefd53",
            "ba39f973eff84456afd147d968f3c890",
            "8d21a09ca1b749cb8fd95653b29f6163",
            "c6d136264d984ebbb3382756d2ce20e5",
            "7a0d405f11ca47349c1e11c3eea1a489",
            "5252c5c5cdea4ce992a719a111870c25",
            "a04712b1f0e9499dad0a56990b8fc8e5",
            "cb3cedec8abf4a2d98d2d63bbc498482",
            "bbdccbd89c824416ab3102b93c80dc3a",
            "ca9533698f824b31a832ae02260d1c80",
            "a37e4d14e4324e15873f2c11926fcd9c",
            "c69515083f6743eb86ee96b67f127973"
          ]
        },
        "id": "JTnCqvC67yRl",
        "outputId": "d3ba422c-7d91-4ae3-dc58-65a5dd318266"
      },
      "outputs": [],
      "source": [
        "from sentence_transformers import SentenceTransformer\n",
        "\n",
        "model = SentenceTransformer(\n",
        "    'sentence-transformers/all-MiniLM-L6-v2',\n",
        "    device=device\n",
        ")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "_kYY9euaGz9O"
      },
      "source": [
        "### Compute Dense & Sparse Embeddings\n",
        "\n",
        "Create BM25 sparse embeddings:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "ab9979305cb14685a84dc567f578c408",
            "9d12705669334235bc2e08b347943f6f",
            "697c811f9c1d44c5a9a57c74313a0b17",
            "9b592f9d096e4e248e96e457919c3c60",
            "a1bb235920d34480a7a4f07d5a63ff68",
            "21bc00fcae78427fab452ee5425444a9",
            "6a3c08d422eb4a9b94fbc6367b31e4cd",
            "885360bd6546486195a57f2325e5cc78",
            "89295b5fa4cc4d2d9a1ed5e3ceb66a08",
            "fcb1bdecabf447178fdd12598a872016",
            "258cbfbadf1149ef8116008316463861"
          ]
        },
        "id": "0dZAv1gU79x1",
        "outputId": "73669f3d-4383-4860-e749-6b5040ca70d9"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "5f1261044aa0409f952f9b5820a8f473",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/100 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "sparse_values = []\n",
        "\n",
        "batch_size = 20\n",
        "for i in tqdm(range(0, len(df), batch_size)):\n",
        "  sparse_values += splade.encode_documents(df.iloc[i:i + batch_size][\"text\"].tolist())\n",
        "\n",
        "df['sparse_values'] = sparse_values"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "SIP62GHcGz9O"
      },
      "source": [
        "And now encode dense vector embeddings:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 49,
          "referenced_widgets": [
            "59820b3af7ef4c76b9b58c69e5645437",
            "f01c8381afc54c8fadf8c4d79acad5c9",
            "08a53f35d60c4a69a9956896dbfe61ea",
            "67d30ac9547547ff80ea1044895af638",
            "4a0811dacc534e4fbb502de3f7c93264",
            "a11c79127ce64cc2a0e25fed9409717f",
            "e36c1c5a080c41e08166e1dde61866e3",
            "5813628447694435874a7f886f6eb689",
            "5aad530d3b5b4087a001db5f33b63c69",
            "1c4e779b6e814054ba8a786fe3e2b1e5",
            "604479fb9121432995071ce9de044e28"
          ]
        },
        "id": "7LEwg_JM8N6j",
        "outputId": "fd29288a-6491-4998-e1ef-9b72ff2a83c6"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8bf86310877c4afcbef147c900e2330f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/100 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "batch_size = 20\n",
        "dense_values = []\n",
        "for i in tqdm(range(0, len(df), batch_size)):\n",
        "  dense_values += model.encode(df.iloc[i:i + batch_size][\"text\"].tolist()).tolist()\n",
        "\n",
        "df['values'] = dense_values"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "w7dkw9V-Gz9O"
      },
      "source": [
        "We organize our dataframe to align to the `pinecone-datasets` format:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "id": "hwirAGEYFh3v"
      },
      "outputs": [],
      "source": [
        "df_result = df.copy()\n",
        "df_result[\"metadata\"] = None\n",
        "df_result[\"blob\"] = df_result[\"text\"].apply(lambda t: {\"text\": t})\n",
        "df_result = df_result.drop(columns=\"text\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>id</th>\n",
              "      <th>duplicated_questions</th>\n",
              "      <th>sparse_values</th>\n",
              "      <th>values</th>\n",
              "      <th>metadata</th>\n",
              "      <th>blob</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>1</td>\n",
              "      <td>[2]</td>\n",
              "      <td>{'indices': [1012, 2011, 2017, 2022, 2044, 219...</td>\n",
              "      <td>[0.06814990937709808, -0.03966414928436279, -0...</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'What is the step by step guide to in...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>3</td>\n",
              "      <td>[4, 282170, 380197, 488853]</td>\n",
              "      <td>{'indices': [1005, 1006, 1007, 1011, 1012, 104...</td>\n",
              "      <td>[-0.04679807275533676, 0.15511494874954224, -0...</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'What is the story of Kohinoor (Koh-i...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>5</td>\n",
              "      <td>[6]</td>\n",
              "      <td>{'indices': [2017, 2064, 2076, 2078, 2096, 211...</td>\n",
              "      <td>[-0.02832488901913166, 0.037209585309028625, -...</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'How can I increase the speed of my i...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>7</td>\n",
              "      <td>[8]</td>\n",
              "      <td>{'indices': [1045, 1998, 2009, 2017, 2033, 204...</td>\n",
              "      <td>[0.06325336545705795, -0.05639313533902168, 0....</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'Why am I mentally very lonely? How c...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>9</td>\n",
              "      <td>[10, 109465, 480204]</td>\n",
              "      <td>{'indices': [1012, 1999, 2028, 2029, 2135, 224...</td>\n",
              "      <td>[-0.04876847192645073, -0.025538966059684753, ...</td>\n",
              "      <td>None</td>\n",
              "      <td>{'text': 'Which one dissolve in water quikly s...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   id         duplicated_questions   \n",
              "0   1                          [2]  \\\n",
              "1   3  [4, 282170, 380197, 488853]   \n",
              "2   5                          [6]   \n",
              "3   7                          [8]   \n",
              "4   9         [10, 109465, 480204]   \n",
              "\n",
              "                                       sparse_values   \n",
              "0  {'indices': [1012, 2011, 2017, 2022, 2044, 219...  \\\n",
              "1  {'indices': [1005, 1006, 1007, 1011, 1012, 104...   \n",
              "2  {'indices': [2017, 2064, 2076, 2078, 2096, 211...   \n",
              "3  {'indices': [1045, 1998, 2009, 2017, 2033, 204...   \n",
              "4  {'indices': [1012, 1999, 2028, 2029, 2135, 224...   \n",
              "\n",
              "                                              values metadata   \n",
              "0  [0.06814990937709808, -0.03966414928436279, -0...     None  \\\n",
              "1  [-0.04679807275533676, 0.15511494874954224, -0...     None   \n",
              "2  [-0.02832488901913166, 0.037209585309028625, -...     None   \n",
              "3  [0.06325336545705795, -0.05639313533902168, 0....     None   \n",
              "4  [-0.04876847192645073, -0.025538966059684753, ...     None   \n",
              "\n",
              "                                                blob  \n",
              "0  {'text': 'What is the step by step guide to in...  \n",
              "1  {'text': 'What is the story of Kohinoor (Koh-i...  \n",
              "2  {'text': 'How can I increase the speed of my i...  \n",
              "3  {'text': 'Why am I mentally very lonely? How c...  \n",
              "4  {'text': 'Which one dissolve in water quikly s...  "
            ]
          },
          "execution_count": 29,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df_result.head()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "lXDl8lvoGz9P"
      },
      "source": [
        "And now we have all we need to start using Pinecone vector database \ud83d\ude80\n",
        "\n",
        "For more details on that, check out [this notebook](https://github.com/pinecone-io/examples/blob/master/pinecone/sparse/splade/splade-quora.ipynb)."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.11.2"
    },
    "vscode": {
      "interpreter": {
        "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}